Key outcomes

To do’s

Questions

First description

table(alltools$method)
## 
##        diartk           ldc   openSat_Sum openSat_noSum 
##           976          1081           978           978
summary(alltools$DER)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   65.93   90.93  106.38  110.60 3298.96
alltools$DER[alltools$DER>100]<-NA

summary(alltools$B3F1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2252  0.5482  0.6243  0.6192  0.6894  1.0000
summary(alltools$MI)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.008839 0.048186 0.122271 0.164402 1.551829

LDC has analyzed 1081 segments, whereas opensat managed 978. The fact that the number for diartk is lower probably relates to the fact that only segments with some speech get analyzed.

As usual, DER returns some ridiculous values. Since DER is a rate, it should go from 0 to 100. We NA values above 100, namely 34% of the values.

There is nothing to be said of B3 F1. It goes from 0 to 1, as it should.

I don’t know enough about MI to say much about it, except that there seem to be some outlier values.

Speech activity detection

3 options: opensat no sum, opensat sum, ldc

overall performance

direct comparison for individual recordings

ADD LINE NAMES HAVE CLIP ID, but odd looking…

Statistical comparison

The following regression does not take into account repeated measures, which could change results making them more stark, or by losing significance. It is unlikely that this will change the fact that there is a slight trend towards lower performance for openSat than ldc.

## 
## Call:
## lm(formula = B3F1 ~ method, data = sad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17397 -0.06646 -0.01473  0.04707  0.35028 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.680258   0.002881 236.096  < 2e-16 ***
## methodopenSat_Sum   -0.030538   0.004181  -7.305 3.54e-13 ***
## methodopenSat_noSum -0.027270   0.004181  -6.523 8.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09473 on 3034 degrees of freedom
## Multiple R-squared:  0.0211, Adjusted R-squared:  0.02045 
## F-statistic:  32.7 on 2 and 3034 DF,  p-value: 8.926e-15
## 
## Call:
## lm(formula = DER ~ method, data = sad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.259 -12.925   1.285  13.680  56.095 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          43.9047     0.6639   66.13   <2e-16 ***
## methodopenSat_Sum    34.9898     1.0642   32.88   <2e-16 ***
## methodopenSat_noSum  35.3542     1.0402   33.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.12 on 2131 degrees of freedom
##   (903 observations deleted due to missingness)
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4283 
## F-statistic: 799.9 on 2 and 2131 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = MI ~ method, data = sad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10268 -0.03498 -0.02294  0.01983  0.73191 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.102682   0.002353   43.64   <2e-16 ***
## methodopenSat_Sum   -0.063762   0.003414  -18.68   <2e-16 ***
## methodopenSat_noSum -0.067699   0.003414  -19.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07736 on 3034 degrees of freedom
## Multiple R-squared:  0.1424, Adjusted R-squared:  0.1419 
## F-statistic:   252 on 2 and 3034 DF,  p-value: < 2.2e-16

Next a direct comparison between the two opensat. The difference is not significant, and numerically very small. Taking into account repeated measures would be important.

## 
##  Welch Two Sample t-test
## 
## data:  B3F1 by method
## t = -0.76969, df = 1948.9, p-value = 0.4416
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.011594640  0.005058772
## sample estimates:
##   mean in group openSat_Sum mean in group openSat_noSum 
##                   0.6497200                   0.6529879
## 
##  Welch Two Sample t-test
## 
## data:  DER by method
## t = -0.38522, df = 1207.2, p-value = 0.7001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.220224  1.491444
## sample estimates:
##   mean in group openSat_Sum mean in group openSat_noSum 
##                    78.89456                    79.25895
## 
##  Welch Two Sample t-test
## 
## data:  MI by method
## t = 1.3853, df = 1952, p-value = 0.1661
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.001636617  0.009511004
## sample estimates:
##   mean in group openSat_Sum mean in group openSat_noSum 
##                  0.03891989                  0.03498270

precision against recall

Diarization

This dataset is not good for diarization because C1 and C2, F2 and F3, and M1 and M2 are interchangeable. Nonetheless, for what it’s worth, here are the results.

Old results

methods=dir("old_res/",pattern="txt")
oldres=NULL
for(method in methods){
  read.table(paste0("old_res/",method),header=F,skip=1)->x
  oldres=rbind(oldres,cbind(method,x))
}
names(oldres)<-c("method","clip","prec","rec","f1")
summary(oldres)
##                        method                           clip     
##  score_ldc.txt            :1084   aiku_20160714_12780.rttm:   6  
##  score_openSAT_1+2.txt    :1018   aiku_20160714_16380.rttm:   6  
##  score_openSAT_12.txt     : 979   aiku_20160714_1980.rttm :   6  
##  score_openSAT_1234.txt   : 979   aiku_20160714_19980.rttm:   6  
##  score_openSAT_123478.txt :1083   aiku_20160714_27180.rttm:   6  
##  score_openSAT_1234789.txt:1083   aiku_20160714_30780.rttm:   6  
##                                   (Other)                 :6190  
##       prec             rec               f1       
##  Min.   :0.5000   Min.   :0.5000   Min.   :0.500  
##  1st Qu.:0.5800   1st Qu.:0.5800   1st Qu.:0.620  
##  Median :0.6700   Median :0.6800   Median :0.670  
##  Mean   :0.7011   Mean   :0.7021   Mean   :0.689  
##  3rd Qu.:0.8000   3rd Qu.:0.8100   3rd Qu.:0.740  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
## 

For LCD, there are 3 clips more in the old results versus the latest batch.

For opensat’s, there is 1 clip more in the old results versus the latest batch, when considering 12– but there is a lot more when considering 1+2 and 123478.

Why different N’s for 12, 1+2 and 123478??

t.test(oldres$f1, sad$B3F1[sad$method=="ldc"])
## 
##  Welch Two Sample t-test
## 
## data:  oldres$f1 and sad$B3F1[sad$method == "ldc"]
## t = 2.747, df = 1512.9, p-value = 0.006086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.002505028 0.015017034
## sample estimates:
## mean of x mean of y 
## 0.6890186 0.6802576

Strange, LDC yielded better results before than it does in the latest batch…

Comparisons across systems

summary(lm(f1~method+(1/clip),data=oldres))
## 
## Call:
## lm(formula = f1 ~ method + (1/clip), data = oldres)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20631 -0.06340 -0.02061  0.04369  0.34939 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      0.678044   0.002876 235.797  < 2e-16 ***
## methodscore_openSAT_1+2.txt      0.038262   0.004132   9.260  < 2e-16 ***
## methodscore_openSAT_12.txt       0.045347   0.004174  10.863  < 2e-16 ***
## methodscore_openSAT_1234.txt     0.045357   0.004174  10.866  < 2e-16 ***
## methodscore_openSAT_123478.txt  -0.027435   0.004068  -6.745 1.67e-11 ***
## methodscore_openSAT_1234789.txt -0.027435   0.004068  -6.745 1.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09468 on 6220 degrees of freedom
## Multiple R-squared:  0.1029, Adjusted R-squared:  0.1022 
## F-statistic: 142.8 on 5 and 6220 DF,  p-value: < 2.2e-16

The best is _12, which is indistinguishable from _1234; a little odd that 123478 and same with 9 give exactly the same numbers. Notice that here OpenSat outperforms LDC!

Source of variability in performance

Ideas for what might be harder:

read.table("../derivedFiles/line_per_segment_age.txt",header=T)->human
human$recstart=ifelse(human$recn==1,0,15*60*60)
human$clip=paste0(human$child,"_",substr(human$date,1,4),
                  substr(human$date,6,7),
                  substr(human$date,9,10),"_",
                  human$recstart + human$chunkstart+180
                  )

length(levels(factor(human$File)))
## [1] 1573
length(levels(factor(human$clip))) #the number we should end up with
## [1] 1439
aggregate(human$dur,by=list(human$clip,human$speakerID),sum)->sums
sums[sums$Group.2=="Noise","Group.1"]->withnoise
length(withnoise) #N of clips with noted noise
## [1] 345
aggregate(human$age,by=list(human$clip),mean)->age
names(age)<-c("clip","age")
dim(age) #N ok
## [1] 1439    2
aggregate(human$dur[human$type==0],by=list(human$clip[human$type==0]),sum)->durnonling
names(durnonling)<-c("clip","durnonling")
dim(durnonling)
## [1] 417   2
merge(age,durnonling,by="clip",all=T)->mytab
mytab$durnonling[is.na(mytab$durnonling)]<-0
data.frame(table(human$clip))->nsegs
names(nsegs)<-c("clip","nsegs")
merge(mytab,nsegs, all=T)->mytab

mytab$withnoise<-ifelse(mytab$clip %in% withnoise,1,0)
dim(mytab)
## [1] 1439    5
tocomp$clip=gsub(".rttm","",tocomp$clip)
merge(tocomp,mytab,all=T)->x
dim(x)
## [1] 1490   13
lm(f1.ldc ~ age+durnonling+nsegs+withnoise,data=x)->mylm
plot(mylm)

zscore=function(x) (x-mean(x, na.rm=T))/sd(x,na.rm=T)
x$age.z=zscore(x$age)
x$nsegs.z=zscore(x$nsegs)
x$durnonling.z=zscore(x$durnonling)

lm(f1.ldc ~ age.z+durnonling.z+nsegs.z+withnoise,data=x)->mylm2
plot(mylm2)

lm(f1.ldc ~ age.z+durnonling.z+nsegs.z+withnoise,data=x,subset(age.z<3,durnonling.z<3,nsegs.z<3))->mylm3
plot(mylm3)

summary(mylm3)
## 
## Call:
## lm(formula = f1.ldc ~ age.z + durnonling.z + nsegs.z + withnoise, 
##     data = x, subset = subset(age.z < 3, durnonling.z < 3, nsegs.z < 
##         3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17162 -0.06257 -0.01631  0.04874  0.31936 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.679113   0.003817 177.938   <2e-16 ***
## age.z         0.001145   0.002972   0.385   0.7001    
## durnonling.z -0.002899   0.002934  -0.988   0.3234    
## nsegs.z       0.003959   0.004664   0.849   0.3961    
## withnoise    -0.016031   0.006697  -2.394   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09175 on 923 degrees of freedom
##   (562 observations deleted due to missingness)
## Multiple R-squared:  0.007652,   Adjusted R-squared:  0.003352 
## F-statistic: 1.779 on 4 and 923 DF,  p-value: 0.1308

Interpret carefully: looks like there are a few points with too much impact. In any case, the regression overall is not significant, and a minute portion of variance is explained (less than 1 pc). The only significant predictor is the number of coded segments (i.e. complexity of the conversation).

plot(x$f1.ldc~x$nsegs,pch=20,xlim=c(0,55))
abline(lm(x$f1.ldc[x$nsegs<55]~x$nsegs[x$nsegs<55]),col="red")

boxplot(x$f1.ldc~x$withnoise,name="Performance as a function of whether there is background noise")